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Abstract 



This short paper describes a simple coding technique, Sparse Sequential Dirichlet 
Coding, for multi-alphabet memoryless sources. It is appropriate in situations where 
only a small, unknown subset of the possible alphabet symbols can be expected to 
occur in any particular data sequence. We provide a competitive analysis which shows 
O ' that the performance of Sparse Sequential Dirichlet Coding will be close to that of 

a Sequential Dirichlet Coder that knows in advance the exact subset of occurring 
alphabet symbols. Empirically we show that our technique can perform similarly to 
the more computationally demanding Sequential Sub- Alphabet Estimator, while using 
less computational resources. 

cn 

1 Introduction 

O 

Suppose we needed to code a sequence of symbols X\. n := X\X 2 ■ ■ ■ x n from an unknown 
alphabet A generated by an unknown memoryless data generating source \i. If we knew 
an alphabet X such that A C X, one solution would be to code the sequence using the 
d ■ Sequential Dirichlet Estimator 



I \ TT C ( X 1:«) + 2 
Px{X\:n) ■= | 



■ i 1*1 1 ! 

i=i 1 + V - 1 



where c(xi- n ) := y^,?_^I\x n = Xj], as suggested by iKrichevsky and Trofimovl [I98l|. This 



technique has the property [Tjalkens et all 119931 ] that 



-!og2— } r- o — \og 2 n+\X\-l. 2 

fjL{x lsn ) 2 

As Equation [2] suggests however, performance of this particular coding technique can be 
poor for small values of n when \A\ is much less than \X\. This problem occurs often when 
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using context-based techniques for data compression. This is because, for many contexts, 
only a small subset of the full alphabet symbols are possible. For example, when modeling 
English text it is very rare to see any character other than the letter u immediately following 
the letter q. If we knew A in advance, we could code X\ in using p^, which from Equation [2] 
would of course give a redundancy no greater than 

\^l\ og2 n+\A\-l. (3) 



The Sequential Sub-alphabet estimator proposed by iTjalkens et al.l [1993| provides a 
natural Bayesian solution to this dilemma. Rather than using the superset alphabet X, their 
technique weights over the set of all possible Sequential Dirichlet Estimators whose alphabets 
are subsets of X. This leads to an elegant algorithm that has a coding redundancy no more 
than 

log 2 \X\ + log 2 + log 2 n + \A\ + 1, (4) 

when using a uniform prior over sub-alphabets. Unfortunately this method requires 0(|Af|) 
time to process each new symbol, and 0(|A"|) space. This can be prohibitive in situations 
where \X\ is large. It would be better if the the time and space complexity were instead 
dependent on at most \A\. This paper introduces a simple method, the Sparse Sequential 
Dirichlet Estimator, which achieves similar redundancy properties to the Sequential Sub- 
alphabet Estimator whilst being able to process each symbol in 0(1) time using at most 
0(|*4.|) space. 



2 Preliminaries 

We begin with some notation for data generating sources. An alphabet is a finite, non-empty 
set of symbols, which will denote as either A or X. A binary string X\X<i . . . x n G X n of length 
n is denoted by x\. n . The prefix x\.j of Xi :n , j < n, is denoted by x<j or x<j+i. The empty 
string is denoted by e. The concatenation of two strings s and r is denoted by sr. 

A probabilistic data generating source p is defined to be a sequence of probability mass 
functions p n : X n — > [0, 1], for n G N, satisfying the constraint that 

Pn{X\:n) = / ] Pn+1 (XlinV) 

for all x\- n G X n , with base case po( e ) = 1. As the meaning is always clear from the argument 
to p, we drop the subscripts on p from here onwards. Under this definition, the conditional 
probability of a symbol x n given previous data x <n is defined as p(x n \x <n ) := p(x\. n ) / p(x <n ) 
if p(x <n ) > 0, with the familiar chain rule p(xi :n ) = YYi=i p( x i\ x <i) now following. 

A source code c : X* — > X* assigns to each possible data sequence X\. n a binary codeword 
c(xi :n ) of length £ c (xi- n ). The typical goal when constructing a source code is to minimize 
the lengths of each codeword while ensuring that the original data sequence X\ :n is always 
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recoverable from c{x\. n ). Given a data generating source /x, we know from Shannon's Source 
Coding Theorem that the optimal (in terms of expected code length) source code c uses code- 
words of length — log 2 /i(xi :n ) bits for all X\ :n . This motivates the notion of the redundancy 
of a source code c given a sequence X\. n , which is defined as r c (x\- n ) := £ c ( x i:n) +I°g2 f J >{ x i:n)- 
Provided the data generating source is known, near optim al redundancy can essentially be 



achieved by using arithmetic encoding [Witten et al.l . 11987] . More precisely, using a„ to de- 



note the source code obtained by arithmetic coding using probabilistic model /i, the resultant 
code lengths are known to satisfy 

4 M ( x l:n) < - l0g 2 K X l:n) + 2, (5) 

for all Xi :n , which implies that the redundancy is always less than 2. In practice however, 
the true data generating source \i is typically unknown. The data can still be coded using 
arithmetic encoding with an alternate coding distribution p, however now we expect to use 
an extra E M [log 2 /i(xi :n )/p(xi :n )] bits to code the random sequence x± :n ~ \i. From here 
onwards, we restrict our attention to that of specifying a good coding distribution. 



3 Sparse Sequential Dirichlet Distribution 

We now propose an adapted version of the Sequential Dirichlet Distribution, which will 
use less computational resources than the Sequential Sub-Alphabet Estimator, while still 
performing well in situations where \A\ is much less than \X\. 

Definition 1. Given an alphabet X , for all n £ N and for all x 1:n £ X n , the Sparse 
Sequential Dirichlet distribution £ : X* — > (0, 1] is defined as 

n ^/ Nil 

£(*!:„) := II 1 ^) = °] W( \\ + > °] d - d) . V(^.)l 1 1 ( 6 ) 

a \ x \-\ u ( x <i)\ i+^^-i 

where c(x 1:n ) := YJiZl ^[ x n = x i], U(x 1:n ) := {s £ X : c(x 1:n s) > 0} and a { := \ for % £ N. 

In the above, U(xi :n ) is simply the number of distinct symbols occurring in x\ :n . Further- 
more, one can easily verify that £ is a valid probability measure over finite but arbitrarily 
large strings whose symbols are from the alphabet X . 

Computational Properties. Given a sequence x\. n £ A n , can be computed in 

0(n) time, with 0(|^4|) space required to store the counts for the seen symbols. Further- 
more, by using £(x n \x <n ) = C( x i:n)/C( x <n) i n combination with the chain rule C( x i-.n) — 
£(x n | x <n )£(x <n ), each symbol x n+ i can be processed in 0(1) time, leading to a straight- 
forward incremental algorithm. As usual, numerical underflow issues can be addressed by 
storing all probability values in log-space. 
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Analysis. We now show that Sparse Sequential Dirichlet Coding using an alphabet of X 
performs well provided there exists an alphabet A G X for which Sequential Dirichlet Coding 
performs well. Our goal will be to provide a redundancy bound which does not exhibit a 
linear dependence on \X\. 

Theorem 1. Given alphabets X and A such that A C X , for all n G N, for all x\- n G A n , 
we have - log 2 £{x 1:n ) < log 2 n + \A\ log 2 \X\ - log 2 p^(xi :n ). 

Proof. First note that since \X\ > \X\ — |C/(x<i)| and \U(xi- n )\ < \A\ for all x\. n G A n , 



i{x v .n) > Y[l[c{x l:i ) = 0] a,— + I[c(x 1H ) > 0] (1 - a 



2_ 

I vi 1 ^L^v^J-^y ^ "j . \a\ 

\X\ t + J#-l 

i \ i 



Now, noting that ctj = | > |/(z + \A\/2 - 1) for all i G N, we get 

> II (l[c(a?w) = 0]^ + l[c( Xl . A ) > 0](1 - a,)) • 

Since there can be at most |*4.| new symbols, with the first symbol always being new, and 

n 

l<i<n:I[c(x 1:i )>0] «=2 

we can conclude 

Now, simplifying the telescoping product and applying the definition of p^ (see Equation [I]) 
to the right-hand side of Equation [7] gives n~ 1 \X\~^ pA.(xi :n ) . Hence, 

-log 2 £(xi :ri ) < -logan" 1 !^!" 1 - 4 ^^^!^) = \og 2 n + \ A\\og 2 \X\ - \og 2 pAixt-.n). 

□ 

Thus, combining Theorem [TJ Equation [5] and Equation [31 the overall coding redundancy 
of the Sparse Sequential Dirichlet Distribution is upper bounded by 

t^ti i og2 n + \ A \ i og2 \ X \ + \A\ + 1. (8) 

Discussion. A comparison of Equation [8] to Equation [2] suggests that the redundancy of 
Sparse Sequential Dirichlet Coding will be less than Sequential Dirichlet Coding when \A\ 
is much smaller than \X\. Furthermore, by applying the inequalities 



\A\ log 2 gf < log 2 (j*J) < \A\ log 2 JjgjL 



to bound Equation HJ we can see that our redundancy bound for Sparse Sequential Dirichlet 
Coding is competitive with the redundancy bound for the Sequential Sub-alphabet estimator 
whenever \X\ is much larger than \A\, and worse when \A\ is close to \X\. 
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Method 


Mean 


Min 


Max 


ORACLE 


185.048 


12.3267 


244.107 


SDC(y4.) 


193.953 


21.368 


243.718 


sdc(A') 


236.343 


63.7581 


286.108 


SSD 


210.844 


23.7755 


262.4 


SSA 


212.257 


24.4074 


262.928 


SSD - SSA 


-1.41272 


-5.77022 


0.366856 



Table 1: Number of bits needed to encode 100 symbols when \A\ = 5 and \X\ =26. 



4 Numerical Experiments 

We now present some numerical results for Sparse Sequential Dirichlet Coding, by comparing 
and contrasting our technique using the experimental framework described below. 



Experimental Setup. Each different experiment consisted of evaluating the performance 
of 5 different coding distributions on synthetically generated data, for various choices of A 
and X . The first technique, oracle, used the true underlying data generating distribution 
to code the data. This is of course the optimal coding distribution in expectation, and 
a natural baseline. The second and third techniques, sdc(.A) and SDc(Af), refer to using 
Sequential Dirichlet Coding using the alphabets A and X respectively. These two methods 
allow us to measure the impact of knowing and not knowing A in advance. SSD refers 
to our Sparse Sequential Dirichlet Coding techn ique. Finally, SSA refers to the Sequential 



Sub- Alphabet technique of lTjalkens et al.l |1993j . 

To evaluate each particular combination of A and X, 100,000 parameter vectors, a^ e M' -4 ' 
for 1 < i < 100, 000, were sampled from a Symmetric Dirichlet Distribution using a con- 
centration parameter of 1.0. These a» were used to define a set of Categorical Distributions 
over the symbols in A. Each Categorical Distribution was used once to generate a data 
sequence of 100 independent and identically distributed random symbols, which were then 
coded using each of the methods. The mean, min and max performance, measured in bits, 
for each different coding distribution on the generated data sequences was then summarised 
in Tables (TJ [2] and |3j Additionally, the last line of each table measured how many extra bits 
Sparse Sequential Dirichlet Coding needed compared with Sequential Sub- Alphabet Coding. 



Results. Tableland Table [2] compare the relative coding performance of Sparse Sequential 
Dirichlet Coding when \A\ is much less than \X\. In both situations we see that the Sparse 
Sequential Dirichlet technique is on average slightly superior to the Sequential Sub-Alphabet 
method, and never worse by more than 1.32 bits. Both techniques performed significantly 
better than the Sequential Dirichlet Coding method which used the alphabet X. This 
is consistent with the redundancy bounds we presented earlier. Lastly, Table [3] gives an 
example of what can happen when the sparsity assumption doesn't apply. Here we see that 
the Sparse Sequential Dirichlet method is outperformed by all other techniques, though not 



5 



Method 


Mean 


Min 


Max 


ORACLE 


278.363 


131.529 


340.359 


SDC(y4) 


293.969 


146.716 


349.882 


SDC(A') 


492.284 


345.031 


548.197 


SSD 


349.169 


181.365 


412.766 


SSA 


350.473 


187.604 


411.656 


SSD - SSA 


-1.30374 


-7.4791 


1.3234 



Table 2: Number of bits needed to encode 100 symbols when \A\ = 10 and \X\ = 256. 



Method 


Mean 


Min 


Max 


ORACLE 


360.053 


234.325 


422.467 


SDC(./4.) 


382.911 


258.392 


440.005 


SDC(A') 


396.527 


272.007 


453.62 


SSD 


410.573 


277.942 


476.754 


SSA 


397.344 


271.68 


456.234 


SSD - SSA 


13.2292 


0.446927 


20.5248 



Table 3: Number of bits needed to encode 100 symbols when \A\ = 18 and 1^1 = 26. 
by a large margin. 

Discussion. In light of its superior computational properties, our results suggest that the 
Sparse Sequential Dirichlet technique is a good alternative to the Sequential Sub-Alphabet 
method whenever \A\ is much less than \X\. If issues of computation or limited memory 
are not an issue, the Sequential Sub-Alphabet method is to be preferred due to its better 
performance when \A\ is not much less than \X\. 

5 Conclusion 

This short paper has described a simple and efficient coding technique for multi-alphabet 
memoryless sources. It provably works well when only a small subset of possible alphabet 
symbols are expected to occur in any given data sequence. As future work, it would be 
interesting to explore the applicability of this technique as a building block within more 
sophisticated context modeling techniques. 

Acknowledgements The authors would like to thank Kee Siong Ng and Marc Bellemare 
for comments that helped improve this paper. 
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